Biometrika Trust 



The Role of Experimental Randomization in Bayesian Statistics: Finite Sampling and Two 
Bayesians 

Author(s): M. Stone 

Source: Biometrika, Vol. 56, No. 3 (Dec, 1969), pp. 681-683 

Published by: Biometrika Trust 

Stable URL: http://www.jstor.org/stable/2334675 

Accessed: 31/08/2014 18:39 



Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at 
http ://www.j stor.org/page/info/about/policies/terms .j sp 



JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of 
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms 
of scholarship. For more information about JSTOR, please contact support@jstor.org. 




Biometrika Trust is collaborating with JSTOR to digitize, preserve and extend access to Biometrika. 



STOR 



http ://www.j stor.org 



This content downloaded from 151.228.195.179 on Sun, 31 Aug 2014 18:39:58 PM 
All use subject to JSTOR Terms and Conditions 



Biometrika (1969), 56, 3, p. 681 
Printed in Great Britain 



681 



Miscellanea 

The role of experimental randomization in Bayesian statistics: 
finite sampling and two Bayesians 

By M. STONE 

University College London 

Summary 

A justification is developed for randomization in finite sampling, involving two Bayesians and the 
scientific community. 

It has been claimed that randomization can bring people into ' virtual unanimity that randomization 
makes 4 data reasonably convincing to other people', that a randomly selected sample has ' increased 
utility*. These and related claims are given by Savage (1954, p. 217; 1962, pp. 34, 87, 89), Kyburg & 
Smokier (1963, p. 186) and Ericson (1969). They are commonly, if somewhat reluctantly, accepted by 
Bayesians. Our present inability to invest them with a precise meaning has important practical implica- 
tions. One concerns the question of how much randomization is necessary to achieve its apparent purposes. 
Another is whether the necessity of including randomization vitiates the current Bayesian approach to 
questions of design. 

Suppose that a Bayesian B is assigned the task of attribute sampling of a labelled population of size N. 
Let 0 { = 1 or 0 according as the ith unit in the population does or does not have the attribute involved. 
A sample of n units is allowed and parameter of interest is r = 0 X + . . . + On* In isolation, B would choose 
n particular items, on the basis of his prior distribution n B (0 ± , . . . , On), say, to maximize average informa- 
tion about r or average utility of estimate of r, as the case may be. 

It is often argued that randomization is useful in this case as a safeguard against self-deception, that, 
especially in those cases where n B leads to a sample peculiarly dependent on it, it is preferable for B to 
opt for the kind of sample that typically results from randomization. We will not discuss this argument 
further than to observe that it does not appear to constitute a substantial case for randomization per se, 
when the data are to be analysed by the Bayesian method. Randomization may be no more than 
a convenient method of achieving with high probability a sample that will be informative about 
possibilities that n B might rate as most unlikely. In a sense, the latter objective is unreasonable. 

The case for using randomization when B is not working in isolation is developed by introducing 
another Bayesian A. A is also interested in r but is obliged to use the data that results from B's sampling. 
We suppose that B would like A to be able to make use of his data in this way. 

If B communicates his prior n B to A, who presumably has his own prior tt a , the data for A is now in 
fact the combination of the sample data and the communicated n B . A complains that this puts him in an 
impossible position. As a Bayesian, A would like to employ all this information but cannot see how to 
incorporate n B . If A believes B to be an honest experimenter, he is loath to ignore n B and to use the sample 
data as if the sample had been selected by some noninformative sampling rule, with likelihood function as 
described by Godambe (1966) . If A believes that B may be dishonest, i.e. attempting to mislead him, then 
even the option of ignoring n B and using just the sample data seems potentially dangerous. For B may be 
concealing some prior information about 0 l9 ...,0n that in fact guided his choice of sample in an effort to 
mislead A; the prior n B communicated may, in that case, be purely window-dressing constructed to make 
the sample appear to be an honestly optimal one. That such construction is quite feasible is demonstrated 
in the Appendix. With this in mind, A is now unwilling to use even the sample data, although he may 
believe B to be honest. This is because he does not want to lay himself open to the charge of naivity from 
other scientists, namely, that he would be acting in such a way that he could have been misled by B had 
B wished to do so. 

However, if B selects his n units by simple random sampling and A knows this, A's problems are 
resolved. A then can and should use his own prior n A in conjunction with the above-mentioned likelihood 
function. Note that simple random sampling is the only sampling scheme that performs this resolution 
perfectly. If B used a random sampling scheme specified by probabilities {P(£)|# any sample of size n} 
with P(S) dependent on S, A would be back in a position of uncertainty about how to deal with the 



This content downloaded from 151.228.195.179 on Sun, 31 Aug 2014 18:39:58 PM 
All use subject to JSTOR Terms and Conditions 



682 



Miscellanea 



information in the different P(S) values, no matter how close these values were to constancy. Such is the 
argument for simple random sampling, requested by A and understood by B. 

However, B may well feel that simple random sampling is too great a concession to A's request for 
randomization in the interests of communication. Suppose B has his own reliable prior information, 
reflected in 7r B , about the existence of two strata and £P 2 constituting the population and such that 
the values of 6 within the strata are relatively homogeneous. To utilize this information, B proposes 
stratified random sampling. How will A react? Consider the case where A's prior distribution is the 
mixed binomial 

7r A (6 1 ,...,6 N ) = J aT(l-a)*~T dF(x). (1) 

If A ignored 7r B , in particular, the stratification component of it, and used n A with the sample likelihood, 
his inference would be manipulable by B. For B would merely have to vary the allocation of the sampling 
between S? 1 and S? 2 ^° D i as A's inference in whichever direction he wanted. This is in illustration of the 
point above; A's method of inference could be criticized as naive and unscientific. What can A do if he 
does not know how to change n A to accommodate the possibly positive information in n B and the 
negative possibility that B is dishonest ? 

The reason A's inference is manipulable by B, if he uses n A given by (1), is that (1) insists on the same 
value of a for both ^and Sf 2 . That is there are very strong connexions in A's prior distribution which can 
be exploited by B. On the other hand A would be safeguarded by taking account of B's stratification to 
the extent of adopting a prior distribution in which the values of 0 in Sf x are relatively independent of the 
values of 0 in Sf 2 . For example, the choice 



(2) 



where N 1 and N 2 are the sizes of the two strata and 

r,= s e t (i=i,2), 

would achieve this, in a way relatively consistent with (1). The combination of (2) with the sample data 
is now immediate in view of the noninformative nature of the random sampling within each stratum. 

The protection afforded to A by the randomization within strata is illustrated by the case when 
and N 2 are very large and large samples of sizes n x and n 2 are taken with % + n 2 = n. If F(a) is smooth 
then, if t x and t 2 denote the totals of observed values of 0 in SP^ and £f 2 respectively, the posterior distribu- 
tion for r has approximate mean N^tjn-^ +N 2 (t 2 /n 2 ). Under random sampling of each stratum, this 
posterior mean has expectation approximately r. Thus B's manipulative power is roughly limited to 
control of the precision of A's inference and is ineffective in controlling its direction. 



References 

Eeicson, W. A. (1969). Subjective Bayesian models in sampling finite populations. J. R. Statist. Soc. 
B 31, to appear. 

Godambe, V. P. (1966). A new approach to sampling from finite populations. I. Sufficiency and linear 

estimation. J. R. Statist. Soc. B 28, 310-19. 
Kyburg, H. E. & Smokler, H. E. (1963). Studies in Subjective Probability. New York: Wiley. 
Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. 

Statist. 27, 986-1005. 
Savage, L. J. (1954). The Foundations of Statistics. New York: Wiley. 
Savage, L. J. (1962). The Foundations of Statistical Inference. London: Methuen. 



Appendix 

Theorem 1. Given the prior probabilities 7r B (6 19 . . ., the n units that, when selected and observed, give 
the maximum average gain of Shannon-Hartley information about 6 1 ,...,6jsi (Lindley, 1956) have 
indices i 19 ... ,i n maximizing 

- 2 7r(0 ii ,...,0 in )log7r(d ii ,...,0 in ), 
where n(6 ii , . . . , 6 in ) are the marginal prior probabilities for d ix , . . . , d in . 



This content downloaded from 151.228.195.179 on Sun, 31 Aug 2014 18:39:58 PM 
All use subject to JSTOR Terms and Conditions 



Miscellanea 



683 



Proof. Suppose that units i lt ...,i n are selected and 0^,...,0^ are the observed values of 0. The posterior 
probability of 6 is 

1 0 otherwise. 

Thus if 0, = 00 ( r= i,... jW ) s 
whence the result stated. 

Theorem 2. Suppose that, by 7r B (6 l9 ##), 0 1? dj$ have independent prior distributions with means 
fi x , ...,/^isr. The choice of n units that minimizes the posterior variance of r consists of those n units with 
fi -values nearest J. 

Proof, If y\, . ..,j\ N - n) are the indices of the unobserved units, the posterior variance of r is the sum of 
the prior variances of 6^, . . . , 6 j(N _ n) . The prior variance of 0 r is J — (fi r — J) 2 . 

From an ethical viewpoint, these theorems are deficient. The informational criterion in Theorem 1 is 
concerned with r only indirectly and neglects the metric with sometimes unattractive consequences; 
the assumption of independence in the n B of Theorem 2 is usually unrealistic. However, they do show 
how easily the sampler B intent on deception can make his choice of units appear honestly optimal. 

[Received May 1969. Revised July 1969] 



A note on the estimation of variance components by the method of fitting constants 

By E. P. CUNNINGHAM 

The Agricultural Institute, Castleknock, Go. Dublin, Ireland, and Institute of 
Animal Genetics and Breeding, Vollebekk, Norway 

Summary 

A simple formula is given connected with the estimation, from unbalanced data, of an upper variance 
component. 

Let 

y = Xa + Zp + € 

be a general multiple classification model in which y is a vector of observations, X and Z are incidence 
matrices of ones and zeros, a is a vector of parameters representing the levels of a particular classification, 
P is another vector of parameters representing the remaining classifications, and perhaps their inter- 
actions, and € is a vector of random errors with variance cr 2 e . If the levels of the a classification have 
random effects with a variance cr\ and do not interact with the remainder of the model, then the mean 
square for <x has expectation ka^ + a\. Together with the error mean square it can be used to estimate the 
two variance components. 

In nonorthogonal data, this mean square can be calculated as a'W -1 a/(g— 1), where a is a vector of 
estimates obtained by solving the complete least squares equations, W is a (q — 1) x (q — 1) symmetric 
submatrix of the inverse of the complete least squares coefficient matrix corresponding to the equations 
for a, and q — 1 is the degrees of freedom for the <x classification. If the restriction a x + . . . + d g = 0 has 
been used in solving the least squares equations, then the expected value of this mean square is 

^{^ ( "' W ~ lS) } = ^l{ tr(W_1) ~ Jsm(W- 1 )Jo-2 + o-2. (1) 

The operators tr and sm are the trace and sum of the elements of the matrix. 

These results are comparatively well known (Harvey, 1960), but the expectation for the case where the 
equally valid and occasionally preferable (Healy, 1968) restriction d q = 0 is used does not appear to have 
been given. It can be derived as follows. 



This content downloaded from 151.228.195.179 on Sun, 31 Aug 2014 18:39:58 PM 
All use subject to JSTOR Terms and Conditions 



